Code to clean the data file-by-file

Importing the necessary libraries

In [1]:
import pandas as pd
import csv
import string
import re
import nltk

nltk.download('stopwords')
nltk.download('names')
from nltk.corpus import stopwords
from nltk.corpus import names
from nltk import word_tokenize
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Aruna\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package names to
[nltk_data]     C:\Users\Aruna\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!
In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

%matplotlib inline
pd.set_option('display.max_colwidth', 150)

(A) Read the CSV File

In [3]:
df = pd.read_csv("C:\\Users\\Aruna\\Documents\\input\\Amazon S3.csv")

df['description'] = df['description'].apply(lambda x: " ".join(x for x in str(x).split())) # converting to string
 
df.head(10)
Out[3]:
id label description
0 16136 Amazon S3 Deleted file returns 'Access Denied', but I need 404 I found a really old file (https://s3.amazonaws.com/bucket/file.pdf) showing up in search eng...
1 16136 Amazon S3 Hi, As you mentioned that you are trying to get a 404 error for a non existing file in S3 bucket. I would like to inform that if your S3 bucket po...
2 16135 Amazon S3 S3 get request I see the following information in the Billing Management Console; Amazon Simple Storage Service 20,000 Get Requests of Amazon S3 7...
3 16135 Amazon S3 Hi, You can see these requests if you enable Server Access Logging for this S3 bucket. Please refer to the following link for more information abo...
4 16134 Amazon S3 Can't create static site s3 bucket for my domain as name already exists! Hi so I'm trying to setup a static website for my domain example.com. I c...
5 16134 Amazon S3 Hi, Please know that S3 bucket names are unique globally. Therefore, if a bucket name is existing already, you won’t be able to create another buc...
6 16133 Amazon S3 Bucket policy to allow access to files only through website's static IP I cannot figure out how to ONLY allow access to files from my website's st...
7 16133 Amazon S3 Hi, Please find the following S3 bucket policy example to restrict the access to S3 objects from IP addresses. https://docs.aws.amazon.com/AmazonS...
8 16132 Amazon S3 SecureTransport policy causing 403 forbidden for versioned file I am hosting a static website using AWS S3 which sits behind a Cloudfront distribu...
9 16131 Amazon S3 What IAM Permissions are needed to do a CreateJob for S3 Batch? Could you give attention to this question please: https://forums.aws.amazon.com/th...
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53443 entries, 0 to 53442
Data columns (total 3 columns):
id             53443 non-null int64
label          53443 non-null object
description    53443 non-null object
dtypes: int64(1), object(2)
memory usage: 1.2+ MB

Check out one sample post:

In [7]:
p = 200

df['description'][p]
Out[7]:
'and this too successfully saved but the error still there { "Version":"2008-10-17", "Statement":[{ "Sid":"AddCannedAcl", "Effect":"Allow", "Principal": { "AWS": }, "Action":["s3:PutObject","s3:PutObjectAcl" ], "Resource":["arn:aws:s3:::bucket/*" ], "Condition":{ "StringEquals":{ "s3:x-amz-acl": } } } ] } Client.InvalidParameterValue: Could not read the ACL associated with the S3 bucket.'

Top 30 words + frequency of each:

In [8]:
pd.Series(' '.join(df['description']).split()).value_counts()[:30]
Out[8]:
the       165637
to        136006
I          80817
a          77433
and        61970
is         54068
of         46770
in         45622
for        44518
you        42271
that       42144
S3         34632
it         33301
on         31233
this       31108
have       28459
with       27072
be         26851
not        23702
can        22272
my         21874
are        20950
bucket     20393
your       19848
from       18716
but        18387
as         18245
an         17771
-          15432
or         15349
dtype: int64
In [9]:
print("There are totally", df['description'].apply(lambda x: len(x.split(' '))).sum(), "words before cleaning.")
There are totally 4244537 words before cleaning.

(B) Text Pre-processing

In [10]:
STOPWORDS = stopwords.words('english')
my_stop_words = ["hi", "hello", "regards", "thank", "thanks", "regard", "best", "wishes", "hey", "amazon", "aws", "s3",
"elastic", "beanstalk", "rds", "ec2", "lambda", "cloudfront", "cloud", "front", "vpc", "sns", "me",
"january", "february", "march", "april", "may", "june", "july", "august", "september", "october", 
"november", "december", "jan", "feb", "mar", "apr", "jun", "jul", "aug", "sep", "sept", "oct", "nov",
"dec", "monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "sunday", "mon", "tue",
"wed", "thu", "fri", "sat", "sun", "ain't", "aren't", "can't", "can't've", "'cause", "could've", "couldn't",
"couldn't've", "didn't", "doesn't", "don't", "hadn't", "hadn't've", "hasn't", "haven't", "he'd", "he'd've",
"he'll", "he'll've", "he's", "how'd", "how'd'y", "how'll", "how's", "i'd", "i'd've", "i'll", "i'll've", "i'm",
"i've", "isn't", "it'd", "it'd've", "it'll", "it'll've", "it's", "let's", "mayn't", "might've", "mightn't",
"mightn't've", "must've", "mustn't", "mustn't've", "needn't", "needn't've", "oughtn't", "oughtn't've", "shan't",
"sha'n't", "shan't've", "she'd", "she'd've", "she'll", "she'll've", "she's", "should've", "shouldn't", "shouldn't've",
"so've", "so's", "that'd", "that'd've", "that's", "there'd", "there'd've", "there's", "they'd", "they'd've", "they'll",
"they'll've", "they're", "they've", "to've", "wasn't", "we'd", "we'd've", "we'll", "we'll've", "we're", "we've",
"weren't", "what'll", "what'll've", "what're", "what's", "what've", "when's", "when've", "where'd", "where's",
"where've", "who'll", "who'll've", "who's", "who've", "why's", "why've", "will've", "won't", "won't've", "would've",
"wouldn't", "wouldn't've", "yall", "yalld", "yalldve", "yallre", "yallve", "youd", "youdve", "youll",
"youllve", "youre", "youve", "do", "did", "does", "had", "have", "has", "could", "can", "as", "is",
"shall", "should", "would", "will", "you", "me", "please", "know", "who", "we", "was", "were", "edited", "by", "pm"]

name = names.words()
STOPWORDS.extend(my_stop_words)
STOPWORDS.extend(name)

REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,:;#+?]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z - _.]+')
REMOVE_HTML_RE = re.compile(r'<.*?>')
REMOVE_HTTP_RE = re.compile(r'http\S+')

STOPWORDS = [BAD_SYMBOLS_RE.sub('', x) for x in STOPWORDS]

Convert to lowercase

In [11]:
df['description'] = df['description'].apply(lambda x: " ".join(x.lower() for x in str(x).split(" ")))

df['description'][p]
Out[11]:
'and this too successfully saved but the error still there { "version":"2008-10-17", "statement":[{ "sid":"addcannedacl", "effect":"allow", "principal": { "aws": }, "action":["s3:putobject","s3:putobjectacl" ], "resource":["arn:aws:s3:::bucket/*" ], "condition":{ "stringequals":{ "s3:x-amz-acl": } } } ] } client.invalidparametervalue: could not read the acl associated with the s3 bucket.'

Remove all HTML tags

In [12]:
df['description'] = df['description'].apply(lambda x: " ".join(REMOVE_HTML_RE.sub(' ', x) for x in str(x).split()))

df['description'][p]
Out[12]:
'and this too successfully saved but the error still there { "version":"2008-10-17", "statement":[{ "sid":"addcannedacl", "effect":"allow", "principal": { "aws": }, "action":["s3:putobject","s3:putobjectacl" ], "resource":["arn:aws:s3:::bucket/*" ], "condition":{ "stringequals":{ "s3:x-amz-acl": } } } ] } client.invalidparametervalue: could not read the acl associated with the s3 bucket.'
In [13]:
df['description'] = df['description'].apply(lambda x: " ".join(REMOVE_HTTP_RE.sub(' ', x) for x in str(x).split()))

df['description'][p]
Out[13]:
'and this too successfully saved but the error still there { "version":"2008-10-17", "statement":[{ "sid":"addcannedacl", "effect":"allow", "principal": { "aws": }, "action":["s3:putobject","s3:putobjectacl" ], "resource":["arn:aws:s3:::bucket/*" ], "condition":{ "stringequals":{ "s3:x-amz-acl": } } } ] } client.invalidparametervalue: could not read the acl associated with the s3 bucket.'

Replace certain characters by space (quotation marks, parantheses etc)

In [14]:
df['description'] = df['description'].apply(lambda x: " ".join(REPLACE_BY_SPACE_RE.sub(' ', x) for x in str(x).split()))

df['description'][p]
Out[14]:
'and this too successfully saved but the error still there   "version" "2008-10-17"  "statement"    "sid" "addcannedacl"  "effect" "allow"  "principal"    "aws"     "action"  "s3 putobject" "s3 putobjectacl"    "resource"  "arn aws s3   bucket *"    "condition"   "stringequals"   "s3 x-amz-acl"            client.invalidparametervalue  could not read the acl associated with the s3 bucket.'

Remove any unwanted symbols (like $, @ etc)

In [15]:
df['description'] = df['description'].apply(lambda x: " ".join(BAD_SYMBOLS_RE.sub('', x) for x in str(x).split()))

df['description'][p]
Out[15]:
'and this too successfully saved but the error still there version 20081017 statement sid addcannedacl effect allow principal aws action s3 putobject s3 putobjectacl resource arn aws s3 bucket  condition stringequals s3 xamzacl client.invalidparametervalue could not read the acl associated with the s3 bucket.'

Remove trailing punctuation marks and any symbol patterns

In [16]:
df['description'] = df['description'].apply(lambda x: " ".join(x.strip('.') for x in x.split()))
df['description'] = df['description'].apply(lambda x: " ".join(x.strip('-') for x in x.split()))
df['description'] = df['description'].apply(lambda x: " ".join(x.strip('_') for x in x.split()))
df['description'][p]
Out[16]:
'and this too successfully saved but the error still there version 20081017 statement sid addcannedacl effect allow principal aws action s3 putobject s3 putobjectacl resource arn aws s3 bucket condition stringequals s3 xamzacl client.invalidparametervalue could not read the acl associated with the s3 bucket'

Remove any numbers

In [17]:
df['description'] = df['description'].apply(lambda x: " ".join(x for x in x.split() if not x.isdigit()))

df['description'][p]
Out[17]:
'and this too successfully saved but the error still there version statement sid addcannedacl effect allow principal aws action s3 putobject s3 putobjectacl resource arn aws s3 bucket condition stringequals s3 xamzacl client.invalidparametervalue could not read the acl associated with the s3 bucket'

Remove the stop words

In [18]:
df['description'] = df['description'].apply(lambda x: " ".join(x for x in x.split() if x not in STOPWORDS
                                                               and len(x) > 1))

df['description'][p]
Out[18]:
'successfully saved error still version statement sid addcannedacl effect allow principal action putobject putobjectacl resource bucket condition stringequals xamzacl client.invalidparametervalue read acl associated bucket'

Results after cleaning data:

In [19]:
df.head()
Out[19]:
id label description
0 16136 Amazon S3 deleted file returns access denied need found really old file showing search engines need removed delete file returns xml error access denied litt...
1 16136 Amazon S3 mentioned trying get error non existing file bucket like inform bucket policy give listbucket permissions everyone getting error place moving forw...
2 16135 Amazon S3 get request see following information billing management console simple storage service get requests 79.62 924.00 requests however see get request...
3 16135 Amazon S3 see requests enable server access logging bucket refer following link information enabling server access logging bucket
4 16134 Amazon S3 create static site bucket domain name already exists trying setup static website domain example.com created bucket name www.example.com worked how...

Top 30 words + frequency of each:

In [20]:
pd.Series(' '.join(df['description']).split()).value_counts()[:30]
Out[20]:
bucket           34085
file             18896
files            17848
using            16604
get              13399
use              12723
access           12557
error            11224
upload           11026
span             10122
request           9682
like              9533
data              9483
one               8941
new               8768
object            7714
time              7710
need              7452
account           7415
server            7395
see               7338
problem           7301
also              7206
stylefontsize     6548
way               6539
help              6538
issue             6354
buckets           6227
want              6189
ms                6179
dtype: int64
In [21]:
print("There are totally", df['description'].apply(lambda x: len(x.split(' '))).sum(), "words after cleaning.")
There are totally 2111826 words after cleaning.

(C) Write to CleanText.csv

In [22]:
with open('C:\\Users\\Aruna\\Documents\\ACMS-IID\\input\\CleanText.csv', 'a', encoding='utf-8', newline='') as csvfile:
    writer = csv.writer(csvfile)
    # writer.writerow(['id', 'label', 'description'])
    for i in range(0, len(df['description'])):
        if len(df['description'][i]) > 1:
            writer.writerow([df['id'][i], df['label'][i], df['description'][i]])

(D) Generate the word cloud

In [23]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 20, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[23]:
(-0.5, 399.5, 199.5, -0.5)
In [24]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 50, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[24]:
(-0.5, 399.5, 199.5, -0.5)
In [25]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 100, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[25]:
(-0.5, 399.5, 199.5, -0.5)
In [26]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 500, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[26]:
(-0.5, 399.5, 199.5, -0.5)
In [27]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 1000, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[27]:
(-0.5, 399.5, 199.5, -0.5)
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: